{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine Tuning GPT-3.5-Turbo\n",
    "\n",
    "In this notebook, we walk through an example of fine-tuning gpt-3.5-turbo.\n",
    "\n",
    "Specifically, we attempt to distill GPT-4's knowledge, by generating training data with GPT-4 to then fine-tune GPT-3.5.\n",
    "\n",
    "All training data is generated using two different sections of our index data, creating both a training and evalution set.\n",
    "\n",
    "Evaluation is done using the `ragas` library, which we will detail later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install llama-index pypdf sentence-transformers ragas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import openai\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n",
    "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Setup\n",
    "\n",
    "Here, we first down load the PDF that we will use to generate training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 20.7M  100 20.7M    0     0  16.7M      0  0:00:01  0:00:01 --:--:-- 16.8M\n"
     ]
    }
   ],
   "source": [
    "!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is generating a training and eval dataset.\n",
    "\n",
    "We will generate 40 questions on different sections of the PDF we downloaded.\n",
    "\n",
    "We can use GPT-3.5 on the eval questions to get our baseline performance.\n",
    "\n",
    "Then, we will use GPT-4 on the train questions to generate our training data. The training data will be collected with out `OpenAIFineTuningHandler`.\n",
    "\n",
    "This step is entirely optional if you don't want to spend the time/tokens -- the eval and training questions are also provided in this folder, as well as the training data!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import SimpleDirectoryReader, ServiceContext\n",
    "from llama_index.llms import OpenAI\n",
    "from llama_index.evaluation import DatasetGenerator\n",
    "\n",
    "documents = SimpleDirectoryReader(\n",
    "    input_files=[\"IPCC_AR6_WGII_Chapter03.pdf\"]\n",
    ").load_data()\n",
    "\n",
    "# Shuffle the documents\n",
    "import random\n",
    "\n",
    "random.seed(42)\n",
    "random.shuffle(documents)\n",
    "\n",
    "gpt_35_context = ServiceContext.from_defaults(\n",
    "    llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.3)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "question_gen_query = (\n",
    "    \"You are a Teacher/ Professor. Your task is to setup \"\n",
    "    \"a quiz/examination. Using the provided context from a \"\n",
    "    \"report on climate change and the oceans, formulate \"\n",
    "    \"a single question that captures an important fact from the \"\n",
    "    \"context. Restrict the question to the context information provided.\"\n",
    ")\n",
    "\n",
    "dataset_generator = DatasetGenerator.from_documents(\n",
    "    documents[:50],\n",
    "    question_gen_query=question_gen_query,\n",
    "    service_context=gpt_35_context,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated  40  questions\n"
     ]
    }
   ],
   "source": [
    "# NOTE: this may take some time. Go grab a coffee!\n",
    "questions = dataset_generator.generate_questions_from_nodes(num=40)\n",
    "print(\"Generated \", len(questions), \" questions\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"train_questions.txt\", \"w\") as f:\n",
    "    for question in questions:\n",
    "        f.write(question + \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Eval Generation\n",
    "\n",
    "Now, lets generate questions on a completely different set of documents, in order to create our eval dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_generator = DatasetGenerator.from_documents(\n",
    "    documents[\n",
    "        50:\n",
    "    ],  # since we generated ~1 question for 40 documents, we can skip the first 40\n",
    "    question_gen_query=question_gen_query,\n",
    "    service_context=gpt_35_context,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated  40  questions\n"
     ]
    }
   ],
   "source": [
    "# NOTE: this may take some time. Go grab a coffee!\n",
    "questions = dataset_generator.generate_questions_from_nodes(num=40)\n",
    "print(\"Generated \", len(questions), \" questions\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"eval_questions.txt\", \"w\") as f:\n",
    "    for question in questions:\n",
    "        f.write(question + \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initial Eval with GPT-3.5-Turbo Query Engine\n",
    "\n",
    "For this eval, we will be using the [`ragas` evaluation library](https://github.com/explodinggradients/ragas).\n",
    "\n",
    "Ragas has a ton of evaluation metrics for RAG pipelines, and you can read about them [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md).\n",
    "\n",
    "For this notebook, we will be using the following two metrics\n",
    "\n",
    "- `answer_relevancy` - This measures how relevant is the generated answer to the prompt. If the generated answer is incomplete or contains redundant information the score will be low. This is quantified by working out the chance of an LLM generating the given question using the generated answer. Values range (0,1), higher the better.\n",
    "- `faithfulness` - This measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "questions = []\n",
    "with open(\"eval_questions.txt\", \"r\") as f:\n",
    "    for line in f:\n",
    "        questions.append(line.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex\n",
    "\n",
    "# limit the context window to 2048 tokens so that refine is used\n",
    "gpt_35_context = ServiceContext.from_defaults(\n",
    "    llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.3), context_window=2048\n",
    ")\n",
    "\n",
    "index = VectorStoreIndex.from_documents(documents, service_context=gpt_35_context)\n",
    "\n",
    "query_engine = index.as_query_engine(similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "contexts = []\n",
    "answers = []\n",
    "\n",
    "for question in questions:\n",
    "    response = query_engine.query(question)\n",
    "    contexts.append([x.node.get_content() for x in response.source_nodes])\n",
    "    answers.append(str(response))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/loganmarkewich/llama_index/llama-index/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluating with [answer_relevancy]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/3 [00:00<?, ?it/s]Connection error caused failure to post http://localhost:1984/runs  in LangSmith API. Please confirm your LANGCHAIN_ENDPOINT.\n",
      " 67%|██████▋   | 2/3 [00:43<00:21, 21.71s/it]Connection error caused failure to patch http://localhost:1984/runs/11ab063b-c643-4ebc-b717-9e756d55b7c0  in LangSmith API. Please confirm your LANGCHAIN_ENDPOINT.\n",
      "100%|██████████| 3/3 [00:55<00:00, 18.63s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluating with [faithfulness]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [03:31<00:00, 70.48s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'ragas_score': 0.8576, 'answer_relevancy': 0.9778, 'faithfulness': 0.7638}\n"
     ]
    }
   ],
   "source": [
    "from datasets import Dataset\n",
    "from ragas import evaluate\n",
    "from ragas.metrics import answer_relevancy, faithfulness\n",
    "\n",
    "ds = Dataset.from_dict(\n",
    "    {\n",
    "        \"question\": questions,\n",
    "        \"answer\": answers,\n",
    "        \"contexts\": contexts,\n",
    "    }\n",
    ")\n",
    "\n",
    "result = evaluate(ds, [answer_relevancy, faithfulness])\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GPT-4 to Collect Training Data\n",
    "\n",
    "Here, we use GPT-4 and the `OpenAIFineTuningHandler` to collect data that we want to train on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import ServiceContext\n",
    "from llama_index.llms import OpenAI\n",
    "from llama_index.callbacks import OpenAIFineTuningHandler\n",
    "from llama_index.callbacks import CallbackManager\n",
    "\n",
    "finetuning_handler = OpenAIFineTuningHandler()\n",
    "callback_manager = CallbackManager([finetuning_handler])\n",
    "\n",
    "gpt_4_context = ServiceContext.from_defaults(\n",
    "    llm=OpenAI(model=\"gpt-4\", temperature=0.3),\n",
    "    context_window=2048,  # limit the context window artifically to test refine process\n",
    "    callback_manager=callback_manager,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "questions = []\n",
    "with open(\"train_questions.txt\", \"r\") as f:\n",
    "    for line in f:\n",
    "        questions.append(line.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex\n",
    "\n",
    "index = VectorStoreIndex.from_documents(documents, service_context=gpt_4_context)\n",
    "\n",
    "query_engine = index.as_query_engine(similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "for question in questions:\n",
    "    response = query_engine.query(question)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Fine-Tuning Data\n",
    "\n",
    "Fine-Tuning data must be written as a list of messages in a `.jsonl` file. Using the finetuning-handler, we can easily write the messages to a `.jsonl` file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote 61 examples to finetuning_events.jsonl\n"
     ]
    }
   ],
   "source": [
    "finetuning_handler.save_finetuning_events(\"finetuning_events.jsonl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Launch Fine-Tuning Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Num examples: 61\n",
      "First example:\n",
      "{'role': 'system', 'content': \"You are an expert Q&A system that is trusted around the world.\\nAlways answer the query using the provided context information, and not prior knowledge.\\nSome rules to follow:\\n1. Never directly reference the given context in your answer.\\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines.\"}\n",
      "{'role': 'user', 'content': 'Context information is below.\\n---------------------\\npage_label: 410\\nfile_name: IPCC_AR6_WGII_Chapter03.pdf\\n\\nIt is challenging to apply this experimental approach to communities or ecosystems (see Figure \\nBox\\xa03.1.1).To date, most research on community or ecosystem response to climate-induced drivers has been in large-volume (>10,000 l) \\nmesocosms (Riebesell and Gattuso, 2014), or at natural analogues such as CO 2 seeps, in which only one driver (ocean acidification) is \\naltered (see (4) in Figure Box\\xa03.1.1).Only very recently have two drivers been incorporated into climate-change manipulation studies \\nexamining responses of primary producers to secondary consumers (see (5) in Figure Box\\xa03.1.1a; Nagelkerken et\\xa0al., 2020).Therefore, \\n‘natural experiments’ from the geological past (Reddin et\\xa0al., 2020) provide insights into how food webs and their constituents respond to \\ncomplex change involving multiple drivers.Contemporary observations are occasionally long enough (>50\\xa0years) to capture community \\nresponses to complex climate change.For example, Brun et\\xa0al.(2019) reported a shift in zooplankton community structure in the North \\nAtlantic (1960–2014), with major biogeochemical ramifications.Conducting sufficiently long manipulation experiments to study the effect of adaptation on organisms is equally difficult (see Figure \\nBox\\xa03.1.1b), with much research restricted to multi-year studies of the microevolution of fast-growing (more than one division per day) \\nphytoplankton species responding to single drivers (Lohbeck et\\xa0al., 2012; Schaum et\\xa0al., 2016).In a few experimental evolution studies \\n(see (7) in Figure Box\\xa03.1.1a; Brennan et\\xa0al., 2017), multiple drivers have been used, but none have used communities or ecosystems (see \\nFigure Box\\xa03.1.1b).Nevertheless, the fossil record provides limited evidence of adaptations to less rapid (relative to present day) climate \\nchange (Jackson et\\xa0al., 2018).Despite the need to explore ecological or biogeochemical responses to projected future ocean conditions, \\nlogistical challenges require that assessments of climate-change impacts at scales larger than mesocosms use large-scale, long-term in \\nsitu observational studies (as documented in Section\\xa03.4).\\n\\npage_label: 409\\nfile_name: IPCC_AR6_WGII_Chapter03.pdf\\n\\n3\\n409Oceans and Coastal Ecosystems and Their Services  Chapter 3\\nunderlies inhibited thermal adaptation under nitrogen-limited \\nconditions (low confidence) (Aranguren-Gassis et\\xa0 al., 2019).When \\nselection is strong due to unfavourable environmental conditions, \\nmicrobial populations can encounter functional and evolutionary \\ntrade-offs evidenced by reducing growth rates while increasing \\ntolerance and metabolism of reactive oxygen species (Lindberg and \\nCollins, 2020).Other trade-offs can be observed in offspring quality \\nand number (Lindberg and Collins, 2020).These findings contribute \\ntowards a mechanistic framework describing the range of evolutionary \\nstrategies in response to multiple drivers (Collins et\\xa0al., 2020), but other \\nhazards, such as extreme events (e.g., MHWs), still need to be included \\nbecause their characteristics may alter the potential for adaptation of \\nspecies and populations to climate change (Gruber et\\xa0al., 2021).3.3.5 Ecological Response to Multiple Drivers\\nAssessing ecological responses to multiple climate-induced drivers \\nrequires a combination of approaches, including laboratory- and \\nfield-based experiments, field observations (e.g., natural gradients, \\nclimate analogues), study of paleo-analogues and the development \\nof mechanistic and empirical models (Clapham, 2019; Gissi et\\xa0 al., \\n2021).Experimental studies of food-web responses are often limited \\nto an individual driver, although recent manipulations have used a \\nmatrix of >1000-l mesocosms to explore ecological responses to both \\nwarming and acidification (see Box\\xa0 3.1; Nagelkerken et\\xa0 al., 2020).Hence, complementary approaches are needed to indirectly explore \\nthe mechanisms underlying ecosystem responses to global climate \\nchange (Parmesan et\\xa0al., 2013).Observations from time series longer \\nthan modes of natural variability (i.e., decades) are essential for \\nrevealing and attributing ecological responses to climate change (e.g., \\nSection\\xa03.4; Barton et\\xa0al., 2015b; Brun et\\xa0al., 2019).Also, paleorecords \\nprovide insights into the influence of multiple drivers on marine \\nbiota (Cross-Chapter Box\\xa0 PALEO in Chapter\\xa0 1; Reddin et\\xa0 al., 2020).Specifically, associations between vulnerabilities and traits of marine \\nectotherms in laboratory experiments correspond with organismal \\nresponses to ancient hyperthermal events (medium confidence) \\n(Reddin et\\xa0 al., 2020).This corroboration suggests that responses to \\nmultiple drivers inferred from the fossil record can help provide insights \\ninto the future status of functional groups, and hence food webs, under \\nrapid climate change.Multi-species and integrated end-to-end ecosystem models are \\npowerful tools to explore and project outcomes to the often-interacting \\ncumulative effects of climate change and other anthropogenic drivers \\n(Section\\xa03.1; Kaplan and Marshall, 2016; Koenigstein et\\xa0al., 2016; Peck \\nand Pinnegar, 2018; Tittensor et\\xa0 al., 2018; Gissi et\\xa0 al., 2021).These \\nmodels can integrate some aspects of the knowledge accrued from \\nmanipulation experiments, paleo- and contemporary observations, help \\ntest the relative importance of specific drivers and driver combinations, \\nand identify synergistic or antagonistic responses (Koenigstein et\\xa0al., \\n2016; Payne et\\xa0al., 2016; Skogen et\\xa0al., 2018; Tittensor et\\xa0al., 2018).As these models are associated with wide-ranging uncertainties \\n(SM3.2.2; Payne et\\xa0 al., 2016; Trolle et\\xa0 al., 2019; Heneghan et\\xa0 al., \\n2021), they cannot be expected to accurately project the trajectories \\nof complex marine ecosystems under climate change; hence, they are \\nmost useful for assessing overall trends and in particular for providing a plausible envelope of trajectories across a range of assumptions \\n(Fulton et\\xa0al., 2018; Peck et\\xa0al., 2018; Tittensor et\\xa0al., 2018).\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: What are some approaches used to assess ecological responses to multiple climate-induced drivers in the context of climate change and the oceans?\\nAnswer: '}\n",
      "{'role': 'assistant', 'content': 'Several approaches are used to assess ecological responses to multiple climate-induced drivers. These include laboratory- and field-based experiments, field observations such as natural gradients and climate analogues, the study of paleo-analogues, and the development of mechanistic and empirical models. Experimental studies often focus on individual drivers, but recent manipulations have used large-volume mesocosms to explore ecological responses to both warming and acidification. Observations from time series longer than modes of natural variability are essential for revealing and attributing ecological responses to climate change. Paleorecords also provide insights into the influence of multiple drivers on marine biota. Multi-species and integrated end-to-end ecosystem models are powerful tools to explore and project outcomes to the often-interacting cumulative effects of climate change and other anthropogenic drivers. These models can integrate some aspects of the knowledge accrued from manipulation experiments, paleo- and contemporary observations, help test the relative importance of specific drivers and driver combinations, and identify synergistic or antagonistic responses.'}\n",
      "No errors found\n",
      "Num examples missing system message: 21\n",
      "Num examples missing user message: 0\n",
      "\n",
      "#### Distribution of num_messages_per_example:\n",
      "min / max: 2, 3\n",
      "mean / median: 2.6557377049180326, 3.0\n",
      "p5 / p95: 2.0, 3.0\n",
      "\n",
      "#### Distribution of num_total_tokens_per_example:\n",
      "min / max: 229, 2011\n",
      "mean / median: 1274.27868852459, 1385.0\n",
      "p5 / p95: 533.0, 1848.0\n",
      "\n",
      "#### Distribution of num_assistant_tokens_per_example:\n",
      "min / max: 11, 334\n",
      "mean / median: 72.36065573770492, 37.0\n",
      "p5 / p95: 23.0, 193.0\n",
      "\n",
      "0 examples may be over the 4096 token limit, they will be truncated during fine-tuning\n",
      "Dataset has ~77731 tokens that will be charged for during training\n",
      "By default, you'll train for 3 epochs on this dataset\n",
      "By default, you'll be charged for ~233193 tokens\n",
      "As of Augest 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.\n",
      "This means your total cost for training will be $0.621848 per epoch.\n",
      "File uploaded...\n",
      "Waiting for file to be ready...\n",
      "Training job launched. You will be emailed when it's complete.\n"
     ]
    }
   ],
   "source": [
    "!python ./launch_training.py ./finetuning_events.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "\n",
    "After some time, your model will be done training!\n",
    "\n",
    "The next step is running our fine-tuned model on our eval dataset again to measure any performance increase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "ft_model_name = \"ft:gpt-3.5-turbo-0613:...\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import ServiceContext\n",
    "from llama_index.llms import OpenAI\n",
    "from llama_index.callbacks import OpenAIFineTuningHandler\n",
    "from llama_index.callbacks import CallbackManager\n",
    "\n",
    "\n",
    "ft_context = ServiceContext.from_defaults(\n",
    "    llm=OpenAI(model=ft_model_name, temperature=0.3),\n",
    "    context_window=2048,  # limit the context window artifically to test refine process\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "questions = []\n",
    "with open(\"eval_questions.txt\", \"r\") as f:\n",
    "    for line in f:\n",
    "        questions.append(line.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex\n",
    "\n",
    "index = VectorStoreIndex.from_documents(documents, service_context=ft_context)\n",
    "\n",
    "query_engine = index.as_query_engine(similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "contexts = []\n",
    "answers = []\n",
    "\n",
    "for question in questions:\n",
    "    response = query_engine.query(question)\n",
    "    contexts.append([x.node.get_content() for x in response.source_nodes])\n",
    "    answers.append(str(response))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluating with [answer_relevancy]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [00:50<00:00, 16.92s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluating with [faithfulness]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [03:15<00:00, 65.20s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'ragas_score': 0.8845, 'answer_relevancy': 0.9758, 'faithfulness': 0.8088}\n"
     ]
    }
   ],
   "source": [
    "from datasets import Dataset\n",
    "from ragas import evaluate\n",
    "from ragas.metrics import answer_relevancy, faithfulness\n",
    "\n",
    "ds = Dataset.from_dict(\n",
    "    {\n",
    "        \"question\": questions,\n",
    "        \"answer\": answers,\n",
    "        \"contexts\": contexts,\n",
    "    }\n",
    ")\n",
    "\n",
    "result = evaluate(ds, [answer_relevancy, faithfulness])\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring Differences\n",
    "\n",
    "Let's quickly compare the differences in responses, to demonstrate that fine tuning did indeed change something."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex\n",
    "\n",
    "index = VectorStoreIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "questions = []\n",
    "with open(\"eval_questions.txt\", \"r\") as f:\n",
    "    for line in f:\n",
    "        questions.append(line.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "What is a key barrier globally for ocean health, governance, and adaptation to climate change according to the report?\n"
     ]
    }
   ],
   "source": [
    "print(questions[12])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Original"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.response.notebook_utils import display_response\n",
    "from llama_index import ServiceContext\n",
    "from llama_index.llms import OpenAI\n",
    "\n",
    "\n",
    "gpt_35_context = ServiceContext.from_defaults(\n",
    "    llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.3),\n",
    "    context_window=2048,  # limit the context window artifically to test refine process\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**`Final Response:`** According to the report, a key barrier globally for ocean health, governance, and adaptation to climate change is the availability of technology, knowledge, and financial support, as well as existing governance structures."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "query_engine = index.as_query_engine(service_context=gpt_35_context)\n",
    "\n",
    "response = query_engine.query(questions[12])\n",
    "\n",
    "display_response(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fine-Tuned"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import ServiceContext\n",
    "from llama_index.llms import OpenAI\n",
    "\n",
    "\n",
    "ft_context = ServiceContext.from_defaults(\n",
    "    llm=OpenAI(model=ft_model_name, temperature=0.3),\n",
    "    context_window=2048,  # limit the context window artifically to test refine process\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**`Final Response:`** The report identifies a broad range of barriers and limits for adaptation to climate change in ecosystems and human systems. These limitations include the availability of technology, knowledge, and financial support, as well as existing governance structures. Existing ocean-governance structures are already facing multi-dimensional, scale-related challenges because of climate change."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "query_engine = index.as_query_engine(service_context=ft_context)\n",
    "\n",
    "response = query_engine.query(questions[12])\n",
    "\n",
    "display_response(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the fine-tuned model provides a more thorough response! This lines up with the increased faithfullness score from ragas, since the answer is more representative of the retrieved context."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "So, in conclusion, finetuning with only ~61 questions actually helped improve our eval scores!\n",
    "\n",
    "**answer_relevancy: 0.9778 -> 0.9758**\n",
    "\n",
    "The answer relenvancy appears to be basically unchanged, between models.\n",
    "\n",
    "**faithfulness: 0.7638 -> 0.8088**\n",
    "\n",
    "The faithfulness appears to have been improved! This mains the anwers given better fuffil the original question that was asked."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
